NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Bridging the Gap between Sequence and Structure Classifications of Proteins with AlphaFold Models

https://doi.org/10.1016/j.jmb.2024.168764

Pei, Jimin; Andreeva, Antonina; Chuguransky, Sara; Lázaro_Pinto, Beatriz; Paysan-Lafosse, Typhaine; Dustin_Schaeffer, R; Bateman, Alex; Cong, Qian; Grishin, Nick V (August 2024, Journal of Molecular Biology)

Full Text Available
Classification of domains in predicted structures of the human proteome

https://doi.org/10.1073/pnas.2214069120

Schaeffer, R. Dustin; Zhang, Jing; Kinch, Lisa N.; Pei, Jimin; Cong, Qian; Grishin, Nick V. (March 2023, Proceedings of the National Academy of Sciences)

Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website ( http://prodata.swmed.edu/ecod/index_human.php ).
more » « less
Full Text Available
Case Studies of Orphan Domain Reclassification in ECOD by Expert Curation

https://doi.org/10.1002/prot.26840

Pei, Jimin; Schaeffer, R_Dustin; Cong, Qian; Grishin, Nick_V (May 2025, Proteins: Structure, Function, and Bioinformatics)

ABSTRACT Homology‐based protein domain classification is a powerful tool for gaining biological insights into protein function. This classification process has been significantly enhanced by the availability of experimental structures and high‐accuracy structural models generated by advanced tools such as AlphaFold. Our Evolutionary Classification of protein Domains (ECOD) database provides a continuously updated and refined domain classification system. Isolated (“orphan”) protein domain families, which have a limited distribution in the protein universe, present a unique challenge in this classification process. These families lack clear or identifiable evolutionary relationships with other sequence families. While some isolated domain families may have emerged through de novo evolution, others potentially share common evolutionary origins with existing domain families but represent difficult cases for traditional classification methods. In this study, we conducted a manual analysis of a set of isolated families of small domains in ECOD. By exploring sequence, structural, and functional evidence, we uncovered distant members and likely homologous relationships between different isolated domain families that were previously unrecognized. Our analysis provides valuable insights into the evolution of isolated domain families and has led to improved classification within ECOD. This work enhances our understanding of protein evolution and underscores the importance of continuous refinement in domain classification systems as new data and analytical methods become available.
more » « less
Refinement and curation of homologous groups facilitated by structure prediction

https://doi.org/10.1002/pro.70074

Schaeffer, Richard_Dustin; Pei, Jimin; Zhang, Jing; Cong, Qian; Grishin, Nick_V (February 2025, Protein Science)

Abstract Domain classification of protein predictions released in the AlphaFold Database (AFDB) has been a recent focus of the Evolutionary Classification of protein Domains (ECOD). Although a primary focus of our recent work has been the partition and assignment of domains from these predictions, we here show how these diverse predictions can be used to examine the reference domain set more closely. Using results from DPAM, our AlphaFold‐specific domain parsing algorithm, we examine hierarchical groupings that share significant levels of homologous links, both between groups that were not previously assessed to be definitively homologous and between groups that were not previously observed to share significant homologous links. Combined with manual analysis, these large datasets of structural and sequence similarities allow us to merge homologous groups in multiple cases which we detail within. These domains tend to be families of domains from families that are either small, previously had few experimental representatives, or had unknown function. The exception to this is the chromodomains, a large homologous group which were increased from “possibly homologous” to “definitely homologous” to increase the consistency of ECOD based their strong homologous links to the SH3 domains.
more » « less
TMEM120A is a coenzyme A-binding membrane protein with structural similarities to ELOVL fatty acid elongase

https://doi.org/10.7554/eLife.71220

Xue, Jing; Han, Yan; Baniasadi, Hamid; Zeng, Weizhong; Pei, Jimin; Grishin, Nick V; Wang, Junmei; Tu, Benjamin P; Jiang, Youxing (August 2021, eLife)

TMEM120A, also named as TACAN, is a novel membrane protein highly conserved in vertebrates and was recently proposed to be a mechanosensitive channel involved in sensing mechanical pain. Here we present the single-particle cryogenic electron microscopy (cryo-EM) structure of human TMEM120A, which forms a tightly packed dimer with extensive interactions mediated by the N-terminal coiled coil domain (CCD), the C-terminal transmembrane domain (TMD), and the re-entrant loop between the two domains. The TMD of each TMEM120A subunit contains six transmembrane helices (TMs) and has no clear structural feature of a channel protein. Instead, the six TMs form an α-barrel with a deep pocket where a coenzyme A (CoA) molecule is bound. Intriguingly, some structural features of TMEM120A resemble those of elongase for very long-chain fatty acids (ELOVL) despite the low sequence homology between them, pointing to the possibility that TMEM120A may function as an enzyme for fatty acid metabolism, rather than a mechanosensitive channel.
more » « less
Full Text Available
Pathogenic mutation hotspots in protein kinase domain structure

https://doi.org/10.1002/pro.4750

Medvedev, Kirill E.; Schaeffer, R. Dustin; Pei, Jimin; Grishin, Nick V. (August 2023, Protein Science)

Abstract Control of eukaryotic cellular function is heavily reliant on the phosphorylation of proteins at specific amino acid residues, such as serine, threonine, tyrosine, and histidine. Protein kinases that are responsible for this process comprise one of the largest families of evolutionarily related proteins. Dysregulation of protein kinase signaling pathways is a frequent cause of a large variety of human diseases including cancer, autoimmune, neurodegenerative, and cardiovascular disorders. In this study, we mapped all pathogenic mutations in 497 human protein kinase domains from the ClinVar database to the reference structure of Aurora kinase A (AURKA) and grouped them by the relevance to the disease type. Our study revealed that the majority of mutation hotspots associated with cancer are situated within the catalytic and activation loops of the kinase domain, whereas non‐cancer‐related hotspots tend to be located outside of these regions. Additionally, we identified a hotspot at residue R371 of the AURKA structure that has the highest number of exclusively non‐cancer‐related pathogenic mutations (21) and has not been previously discussed.
more » « less
Evolution of a chordate-specific mechanism for myoblast fusion

https://doi.org/10.1126/sciadv.add2696

Zhang, Haifeng; Shang, Renjie; Kim, Kwantae; Zheng, Wei; Johnson, Christopher J.; Sun, Lei; Niu, Xiang; Liu, Liang; Zhou, Jingqi; Liu, Lingshu; et al (September 2022, Science Advances)

Muscle fusogens in tunicates and lampreys shed new light on the evolution and developmental mechanism of muscle multinucleation.
more » « less
Full Text Available
Computed structures of core eukaryotic protein complexes

https://doi.org/10.1126/science.abm4805

Humphreys, Ian R.; Pei, Jimin; Baek, Minkyung; Krishnakumar, Aditya; Anishchenko, Ivan; Ovchinnikov, Sergey; Zhang, Jing; Ness, Travis J.; Banjade, Sudeep; Bagde, Saket R.; et al (December 2021, Science)

Protein-protein interactions play critical roles in biology, but the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions not yet identified. We take advantage of advances in proteome-wide amino acid coevolution analysis and deep-learning–based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of yeast proteins, identify 1505 likely to interact, and build structure models for 106 previously unidentified assemblies and 806 that have not been structurally characterized. These complexes, which have as many as five subunits, play roles in almost all key processes in eukaryotic cells and provide broad insights into biological function.
more » « less
Full Text Available

Search for: All records